Clustering 



Cluster Analysis 

• Partitioning a set of data objects into subsets or clusters 

• objects in a cluster are similar, yet dissimilar to objects in other 
clusters 

cluster qjo <=lc$x>szx>J UIjJI p^uusl J$\j>u3 Uls*x> gJLbjLfc ^sOUl cluster <\j\ Jjj3^x> bl 
.cLali^jo u^SLu [^jujlsu oli> cluster3 ^yosu <\+jJj O^Jj J0I9JI cluster ^ bljJI C+jpu 

c IjljuJ v-sJUI JJ9JI O^H ^>^l j^^ L> LS obfcJuuO I3 - >jljuULjU3 c\JjcxC lSAjlC3 ^S^huJ ksS bl 3J 
./X£jJLc J^-Jujl OLuULC \jASLJ *<-Ju£> c UuSZJQ obfcJuuO 3bb\Ld /X^JLLol OLuULC i/ag-J 

• Goal: discovery of previously unknown groups within the data 

UIjJI v^ c *JLds Lp-HLJuuu oL>L> i^JjlSjljujlu Cluster 

(QuSLju^) }+£. ^1^1 <JjO l^jlsfcjU v-SW^>X) lSJulC 3J) 

i CI usters a re implicit classes 

• Applications -> business intelligence, image pattern recognition, web search, 
biology, security 

jBjjljujJI v-sJLc Oljuj95uj <jjo ^jl^I v_sJb> 3J v-sO j9j^SuhuuJ I v-s^B cluster />_xifcJLjuj I ^>Suooo 
^sJLc vilLoi Jjcxszj jjLc v_sJJI o>ax>3 oLoJl^- jjLc v-sJUI qjuo v-sJI sS^ oLljuj35u - >JI /^juuisLp3 

• Clustering can be used for preprocessing and outlier detection 

outlier JuJc*j3 preprocessing ^ cluster />±>±juj\ <j^>uo 
Requirements for Cluster Analysis 

• Scalability -> currently handles small datasets, uses sampling 

jjuS cluster algorithm <>£ ^^^o Jjjjq bljJI o^bj gjo linear Juj^j complexity i^sj 

• Handling different attribute types -> mostly numerical 

(numerical £,1^1 yJLcl) c \Jdi^x>J\ attribute gj^jl gjo Jjolsiil 
i Discovering clusters with arbitrary shape -> currently mostly spherical 

QJihjzx> JlSLjjb cluster v-q-*.lu5I jASl />j\J 

• Domain knowledge & input parameters -> # clusters & clustering results 

_uSls k qJbou ^sJUl bl k-means/>jjj9^JI \kx> Domain knowledge lS-Ut. ^svii^j/)^ 
/x^Ir: test JjcxcI >oj\JLs parameter ^sO^ Juoosiu olo>jj9^J\JI /Joax> .bljJI S: sJ^ .ybbfc 

• Handling noisy data -> currently sensitive to noise 




cluster shape ^sO^ >5b o^ 000 noise lSI 

ujuu^ [&\szx> Jjolsiil jAS[& Jjjjq real time ^ jj&JU cluster Jjcxsj 9J 

y Incremental clustering & insensitivity to input order ->new data requires 
re-computing clusters from scratch - sensitive to order 

Uls*x> gyLJI <JlLj*jL3> record ^juJjj ia!l->I 9J UIjJI ^-juJjj 

• Handling high-dimensionality data -> mostly low Dimensionality 

3 dimension v^sJI IjIjJI handle lS jAiu Cluster 

• Constraint-based clustering -> little support for domain constraints 

cluster JJ constrain icev LoJ exception ]ooA JjjSjsu* 

• Interpretability & usability -> are results comprehensible & usable? 

LgjoAitiLjujI <jSujAiix>3 Pjo^pjijo <jiuuo ,jduu cluster &JLbj o^ 000 

Comparing Cluster Analysis Methods 

• The partitioning criteria -flat or hierarchical? 

layer ^jjq ^jS\ ks^&j. (hierarchical) ^sJI ^3 one layer i^sj (flat) J UIjJI /^juasLfc 

• Separation of clusters - mutually exclusive or overlapping? 

cluster Jl p+jujlsu 
O^sx) cluster OLaJLb 9J ul Cjuov cluster Jl p^uu3\ v^s^sj '.mutually exclusive 

j^iu o^ 000 9J : Overlapping 

• Similarity measure - distance or connectivity/density? 

similarity ^jljuuc^L^ 
Overview of Cluster Analysis Methods 
1. Partitioning 



cluster ^svB A£^o J5 icJUf T \JUuo ^^3 /ufcjjlc ^sOLII cluster :>ac ^^iiul^ J3\JI 
o_xS k-menas lSj cus ^sJUl Jo-q-J^ cluster J5 /x^Ir: cLu ^jljuul^j3 

U&& ch Chh^ J0I3JI cluster ^ similarity Jl 
2. Hierarchical 




ul Cju^u ijbsij gjo <j*^l J^ cy>^l 1-^1^3 0J03J cluster §& item J5 ul ^yb^Lfc 
loQj.ll J5 ^/al^lo JcJ I aSLq^3 similarity ^ chh^ 1^9^ ch^I 

<*Juo Jjj\ JjjSjsza lS^^juuuo OisaJLb 3J Hierarchical pJSLojuo 




Density-based 



DBSCAN cuJbuol CH09 (non linear) cloJo^juo ^-iJI JISLjAJI uuxi oLjjl^^jl^lLjuuuj 0^3 

^jljuUL^ v^sJ-C I^jJI^ V-sJJI J5 CH^S W***. ^L^^^oJI SJLJUUL>I IjLU ,JJAS^9 3.^1.0.) V-SU IjUuU J9\JI 

JiiLjJjIljo JjS O^Jc^Lo v-SNiJ-sv -X>l9-ll _>^juulKJI v-sv3 ^9^>9jo Jclqj ^JlC J3I9 L^^Jl^jo v-sJUI ^LuuuoJI 

J3I3 l^-JI^ v-sJUl JoiijJI J5 Jjfc ^99-jJjl v-sJUl 9J t^JI^ v-sJUIs q.k.QJ J5 qjj <3SLuuuoJI sj^juul^j9 

v_ss3 ^AC J^ls 2 <39LuuuoJI ^Jl^jo Ul 9J v-SnJlSEj ^Asdl [J jjJu lcx& Jjfc O^JL^iX) v-sJUI ^LuuuoJI iJJZ 

Q3LuJloJ\ qjo J^l jo^JulsJ 9J l^JI^ v-sJUIs cdoiiJI o+! ^LuuuoJI ^^jJjI v-Svii-u 4 Jl>I^JI ^^juulKJI 

O^Jl^jo Ul ^JlC Jl9l s^Luuu 9I { jjo jjS\ /xo^JtC v$$j m v\joS$ L^^Jl^jo v^sOUI 



Grid-based 



Lfcjo^J cell JS s5Jb^3 fixed cluster Jxxa^ lS-ut. ^sJUI bljJI ^sdr: grid Jl map v 
Overview of Cluster Analysis Methods 



Method 


Characteristics 


Partitioning 




— Find mutually exclusive clusters of spherical shape 


methods 




— Distance-based 

— Mav use mean or medoid to represent cluster center 

— Effective for small- to medium-size data sets 


Hierarchical 




— Clustering is hierarchy involving multiple levels 


methods 




— Cannot correct erroneous mercies/splits 

— May consider object "linkaqes" 


Density-based 




— Can find arbitrarily shaped clusters 


methods 




— Clusters are dense reqions separated bv low-density reqions 

— Each point must have a minimum number of points within its 
"neiqhborhood" 

— May filter out outliers 


Grid-based 




— Use a multi-resolution qrid data structure 


methods 




— Fast processinq time 



Partitioning Methods 

1. K-Means - A Centroid-Based Technique 

• Divide dataset into k mutually exclusive clusters 

■bAC .bJCfcj v-sJUl UI3 \jGSU ,>£ Ch^Ij^jo I9J9SUU3 ^jljuuiISUI <jjo PS^^osscoJ UIjJI /x_AjaiLu 

cluster 

• Clusters are represented by their centroids 

<^cLu ^UljuJU JjuoOjj cluster J5 

® A centroid is a cluster's center point 

• In /c-means -> centroid is mean of points within cluster 

cluster J>b ^sOUl JoiiJU JixxLu ^^JljuJI k-mean v^ 

• Each object x in cluster has a distance from centroid q -> af/st(x, c,) 
- J x is assigned to most similar cluster -> C,- with m/n d/s^x, c,) 

• Cluster means are updated, then assignment is repeated 

• To measure cluster quality -> minimize sum of squared errors 



e = y y dtst{x, c t y 



Factors to consider: 

• Selection of k 

i Selection of initial centroids 

• Calculation of dissimilarity 

• Calculation of cluster means 
When it fails! 

cluster 3 dissimilarity yLuuo3 cluster J5 £lu ^^huJI ^yb^b sjJUl bl 3 K lSjLj>I 



Clusters with very different sizes & with concave shapes 



means 





lS^^juUUO <JJUUO 3I jSZSLO /XgiSLuJ 3I /)J«JI v-SvS yJjJi]uS>LO ClUSter 1 j^?JLLjuIjO 



Hierarchical Methods 

1. Agglomerative versus Divisive Clustering 

• Hierarchical clustering -> group data objects into a hierarchy or "tree" of 
clusters 

• Agglomerative -> bottom-up (merge) composition 

® Each object has its own cluster 

* Two clusters that are closest merged into a bigger cluster 

* Iteratively merge till termination condition or single cluster is formed 

J0I3 cluster \$suj. \jbsu ^>x> o+^.j^ cluster o^Jul JS3 object 3I item J5 ul ^yb^lj 
cluster ^ac 3I J3\JI ^ <u^>jS O9SI ch^° -^>^ -^J cluster <j^Jul JS §joj>I Ju2aSl3 




Minimal spanning tree! 

d 7 edJUJ^3 vast*} ^jjS\)\ a , b ul 0^3 CLJLJI3 <*]asu J5 <j^j psLuuuoJI vIjljuuu.9 3J \JLix> 
C3 ab o^j cisLuulxJI v-93_jJjL^ /x^jj <*j\$s> {J jjJu JjcxcL^ cluster T lSjulT. Ul ^s^^J^ 

gjo OA^ J5 ujuuS^ ab £9^/ QjuJoiiJI Jl^L^ <*sLuuuoJI ujuuSI LoJ C 3 de ,jju <*sLuuuoJl3 
pJoiiJI <*jS ^sJLII cluster g-J ^sv^ ^ikiD lS\J v.^1 c >>\JI v^svB ^9_jJjLfc3 de Aii£$ c 

IASLq^3 lS^ 

• Divisive -> top-down (split) composition 
^ All objects in one big cluster 
Divide into subclusters 

Recursively divide subclusters into even smaller subclusters 
Terminate when each object has his own cluster or objects in clusters 
are similar "enough" 



i 
i 
i 



jl>J 1^x9 /^juo9I Ijul9 J0I9 cluster ^>c ^U^ s^julc v^sJUI LjIjJI J5 ol ^^la ^jjoisdl L& 



J, 


* 1 1 1 1 

Step 4 Step 3 Step 2 Step 1 

m m< IP^ »^ 

How to divide a cluster is a challenge! Heuristic approaches 






Step Divisive 

F DIANA 

maximum 
distance 




abcde 




ma> 


r be used 





o^LjjjlLo cluster v^ object JS v^^l LJ 9I yj+as* -t>>^ 

CHAMELEON: Multiphase Hierarchical Clustering Using Dynamic Modeling 

• Cluster similarity based on: ^sJ^ t^uuo 

• Interconnectivity -> how well connected objects are within a cluster 

J0I9JI cluster J>b u^slu <*JL^£a> object ul lSjuo 



Closeness -> the proximity of clusters 



cluster o^ vjl^JI jl-^o 



1. Construct a sparse graph: 

* Vertices are data objects 

• there exists an edge between two vertices {x, y} if x is among the /c- 
most similar objects to y -> k-neorest neighbor graph 



Data set 



Construct 

a sparse 

graph 




* Edges are weighted to reflect similarity 

pLj graph /^jujjIs 2 node o^j pj>LuuuoJI Jjuo^j ^sJUI l^xtu weight \jAe- edge J5 

k-nearest neighbor graph />a^lLjuuuu cuas*x> acI^S ^sdr: 



2. Graph partitioning: 

• Minimize edge cut EC{C„ C y )-> minimize weight of edges that would 
be cut to split C into Q & Q -> measures absolute interconnectivity 

lSJuli: UaiJI <^ljuuu Jisl uLjjlt. graph /^juulSI jjLc 

o^/>\KJI Jxxsiu oLojjj^^JI cus g^ouIo J^l djjlc Width 

3. Agglomerative clustering: 

• Measure relative interconnectivity RI(Q, Cj ) 

• Measure relative closeness RC{C„ Cj ) 



Data set 



Construct 

a sparse 

graph 




Partition 

the graph 



A $ V^ Merge 



« 



V 




uLjjlt: cux> > sl^I J15Ljj\J o^ graph /^juaiu o_xS as^3 lS-Ut. ^sOLII bljJU graph Jjcxsl? 
>*iL^> graph />IS JS qjo cluster glbl Cjl*j>j u^lu />^*xol ojlS j^3 UaiJI <^huuu Jisl 



.o^/>\KJI Jxxsiu o>^L> applications ^ <*1x>sl> ^sJUI bl Jiuuo o^ J3\JI Graph 

• Relative Interconnectivity /?/(C„ C y ) -> absolute connectivity between C„ C y 
normalized by internal interconnectivity of C,- # Cj 



Rl(C i ,C j )=j 



\EC 



{Ci>Cj}\ 



zO c q| + 



EC 



>l) 



where ECt c c .\= sum of weights of edges that connect Ci with Cj. 

f C C/ ~> /T7/n si//?? of cut edges that partition Q into two roughly equal 
parts 

Relative lnterconnectivity->Sum 



• Relative Closeness RC(C„ Cj) -> absolute closeness between C„ Cj normalized 
by the internal closeness of C it Cj 



RC(C u Cj) = 





^{d.Cj} 








\c t \ 


°EC Cl 
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Cj 
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\C t \ + \Cj 
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\Cj 



* Secc i "^ average weight of edges connecting vertices in Q to vertices in Q 



O^jLo Jcu^u v_sOLII edge JJ average weight $& S EC 

{Ci.Cj} 

Relative Closeness->average 



• 5 ECc ->avg weight of edges belonging to min-cut bisector of C, 



o\J^ls*xJI ^>J^J 



^jljuulK 2 <jjj Joj^ju v-sJUl edges !<>.q- ? -»j EC( c . c a andS EC 



\EC C .\ + 
i w i 



EC r 



or ^ecc + Sec c ~ we 'ghted sum of edges that partition the cluster 
into roughly equal parts 

,jj3Ljuliuo \j^jj^> v-sJI o^x^juoiiju v_svJJI cluster o\y> ^sJUl edge Isiijuu Js-^3 

1#jj y>uu K,sdi\ edge Jl o-fjj^> sjJUI cluster >o^uaiL*j min-cut bisector Pjoouj^JI qjo 



I W I 



£C r 



or ^ec c + Sec c ^2* ' j0d3i °* kcd\ 



LfcJo^J cluster J5 />laxJl3 cluster o*^ ^ JcljuulJI ^svS ^sJUI ul o\bls*xJI <^d\L> 




EC< n . r \ and ^ Frf 



Min-cut bisector 



FC C . and 5^ 



Density-based Methods 

DBSCAN: Density-Based Clustering Based on Connected Regions with High Density 
• Find core object s (with dense neighbourhoods) 
i Connect core objects to form dense clusters 
i User provides: 

* e-neighborhood of object o ->space within a radius e centered at o 



• Neighborhood density -> # objects in that neighborhood 



• MinPts -> density threshold for a neighborhood ^ JanJI ch° ^-^ JS' 

J0I9JI ^^juulKJI 

• Core object -> object whose 6-neighborhood contains at least MinPts object 

MinPts LS3LUUU 3I qjo jjS\ $£> Ii3^jc*x> 




q qjo l^JL^I jjisl p s5^( P is density-reachable from q) 

v-99-jJul v-SviijLi 5 J0I9JI ^IjuulKJI v-sv3 ^>A£ J3I9 (>ktflJI u^u) 4 <^9LuuuoJI ^A^x) bl ^J v-S^lSj 
v_s\JLII ^sLuuuoJI Lfc^JaS vjtt^Li OjJb v.5^9 l^JI^ v-sJUl) L^Jl9^ S5OLII9 oJoiuJI ch^ ^ 




Given e = 4 and 
MinPts = 5 



an object p is directly from 

another object q if and only if q is a core object 
and p is in the -neighborhood of q 



objects q & m 
is an object o sue 



density-connected 






lS:> v-SNii-u9 ^^ cdoijj ^99-jJjl v-SN^i-u 5 o^ J^ J^iiJI ^JlC lS:> OjjIjJI v^ 9J (4 v.5^9 L^^Jl^jo 

o^>9x> ^sOUl clusters glbl3 JoiiJI J5 ^9_jJjLfc 

^s^sj 03m <j^j <*sLuuuoJI lS j o 3 q o+jIq ^L^^oJI 9J qr <S in density-connected o 

o cho nn JJ J-^sl jJiSl 3I o o^ q JJ J-^s' >^l 



<^jJLl-^x> JlSLJuU cluster glbj jAiLu ^Ij^jouo 



Evaluation of Clustering 
Assessing Clustering Tendency 

• Determines whether a given data set has a non-random structure 

• Hopkins Statistic -> Statistical tests for spatial randomness 

* Sample n points, p lt . . . , p n uniformly from D 

• For each point, p/find its nearest neighbor in D 

v r j^>J3 p^Juu <*sLuuuoJI v-jljuul>u3 (neighbors)lfJ Ko-JI ^jjB\j dataset J5 OjIslu 
dataset ^ JoiiJI ^jjB\ p^B oJasu JiJ k3$ujJuj p, ictsu ^9-^ L^S n lS-Ut. 9J ./>^j>3l 

distance x t = mm{dist(p if v)} 

VED 

i Sample n points, qi, . . . , q n , uniformly from D 

• For each q, find its nearest neighbor of qi in D-{qi\ 

v r j^>J3 >o^Jl*j ^LuulxJI v r jLJuuL>u3 (neighbors)lfJ looJI ^jjB\j dataset J5 OjIslu 
D-{q) v-svS .kLQjJI v>^l A£*9 ^-kiA.i JiJ ^^jJulu P/ la.6.) ^9-^ L#^ n sSJulC $J -AP-L^I 

distance y t = min {dist(q if v)} 

vED,v^qi 

* Calculate the Hopkins Statistic H: 



H = 



2ji=l x i t Z«t=iyi 



• If D is uniformly distributed, Yd-i x i an ^ E?=i 7i are roughly equal, 
andH« 0.5 

IgjJ -h-Q-JI v>^l S^aJ^ j-teu uLjjlC 6y> qjo jJiSI lS^ <3jJxxaJI J>SL*J3 



Measuring Clustering Quality - Extrinsic Methods 

Extrinsic methods -> compare clustering against ground truth (supervision) 

ground truth v jjjujJSJI qjjlajo ^>b ^ pi^fs supervised v^svii-HJ 0^ f^JI 

• Assign a score Q(C, C g ) to capture: 

** Cluster homogeneity -> the purer the better - clusters represent 
separate class labels 



8 Cluster completeness -> an object with a class label belongs to the 
cluster representing that class label 

<\]£ class Jl Jixu 3I jjszj. class ^ object u^Su 

clusters qjo qJLls cu:£x>j*x> ^ Jixxij class J5 s^ 2 ^ 

<* Rag bag -> objects that can't be merged into clusters belong to a rag 
bag - penalize a misc. object when put in a pure cluster more than in a 
rag bag 

Rag bag J ^s^ojjlj cluster ^ l^Jo^l JjjsuluVo ^sJUI Object 

i Small cluster preservation -> splitting a small category is more 
harmful than splitting a large category 

o^julS category /^juoSI v_sOl ch -h^' >^iq >£^>l sl>>l v-sJI >*^ category /^juulSI v-sol 
i Ex. Bcubed precision and reca// of every object in dataset: 

• Precision -> how many objects in the same cluster G the same 
category as the object 

category ^jjjJlJ I^xx^jj J0I9JI cluster ^ object ^ac/>£ 

• Recall -> how many objects of the same category are assigned to the 
same cluster 

cluster ^jjoiiJ l^xxiJu J0I9JI category ^ object ^ac/>1^ 

Intrinsic methods -> measure how well the clusters are separated 

qjU ^^juulKJI o^> puJuj L^ ^Ljo yxt ground truthg unsupervised K r s^u 0^ ££jJI 

yO^Juu guojJI yoJuu ^S\j\s o-J-^.0-»-O Clusters lSIjI ^^jJjlu 

• Ex. The silhouette coefficient -> difference between: 

average distance between object o and all other objects in the cluster 
to which o belongs (captures cluster correctness ) - smaller is better 
(more compact) 

S: s^ojjuj ^sOUl cluster ^jjjJu ^ ^sOUl object JS 0^3 object qju <*sLuuuoJI }cuuj$±x> 
(ujjUUI ^jjjJu ^s object o^f ^^JI)J^bI oj£ LoJ J5 yz^>\ oj£ LoJ J5 I^J 

• minimum average distance from o to all clusters to which o does not 
belong (captures degree of separation from other clusters) - larger is 
better 

LoJ J5 [&j3 J±xx> ^sJUl cujLJI cluster J5 ^^3 object o^j <*sLuuuo ia-KU$±x> Jsl 
(oLujUUI Cjuju3 U jjVS ^ object <j*? ^M^JOch^^I ^^1 LoJ J5 ^1 oul^ 



• Compute average silhouette coefficient for all objects in a cluster or over all 
of the datase t 

^ +ve -> clustering is good 

* -ve -> clustering is bad 

cluster v^ pcusi JU 3I dataset JSU silhouette coefficient JJ Jcljuj^oJI ^jljuuc^ 
cuixxc v-s^B cdSLjjuo Qu3 f^siju ^Luj 9J cljuuu^S cluster <^JLcxr: v-svii-u ^js>^o 9J p39_aJjI3 

cluster 



